Deriving TF-IDF as a Fisher Kernel
نویسنده
چکیده
The Dirichlet compound multinomial (DCM) distribution has recently been shown to be a good model for documents because it captures the phenomenon of word burstiness, unlike standard models such as the multinomial distribution. This paper investigates the DCM Fisher kernel, a function for comparing documents derived from the DCM. We show that the DCM Fisher kernel has components that are similar to the term frequency (TF) and inverse document frequency (IDF) factors of the standard TF-IDF method for representing documents. Experiments show that the DCM Fisher kernel performs better than alternative kernels for nearest-neighbor document classification, but that the TF-IDF representation still performs best.
منابع مشابه
A Kernel for Interactive Document Retrieval Based on Support Vector Machines
This paper describes an application of support vector machines (SVMs) to interactive document retrieval using active learning. We show that an SVM-based retrieval has an association with conventional Rocchio-based relevance feedback by a comparative analysis. We propose a cosine kernel, which denotes cosine similarity, suitable for an SVM-based interactive document retrieval based on the analys...
متن کاملConfidence-Based Techniques for Rapid and Robust Topic Identification of Conversational Telephone Speech
We investigate the impact of automatic speech recognition errors on the accuracy of topic identification in conversational telephone speech. We present a modified TF-IDF featureweighting calculation that provides significant robustness under various recognition error conditions. For our experiments we take conversations from the Fisher corpus to produce 1-best and lattice outputs using one reco...
متن کاملA HowNet-based Semantic Relatedness Kernel for Text Classification
The exploitation of the semantic relatedness kernel has always been an appealing subject in the context of text retrieval and information management. Typically, in text classification the documents are represented in the vector space using the bag-of-words (BOW) approach. The BOW approach does not take into account the semantic relatedness information. To further improve the text classification...
متن کاملTechniques for rapid and robust topic identification of conversational telephone speech
In this paper, we investigate the impact of automatic speech recognition (ASR) errors on the accuracy of topic identification in conversational telephone speech. We present a modified TF-IDF feature weighting calculation that provides significant robustness under various recognition error conditions. For our experiments we take conversations from the Fisher corpus to produce 1-best and lattice ...
متن کاملA Knowledge-Based Semantic Kernel for Text Classification
Typically, in textual document classification the documents are represented in the vector space using the “Bag of Words” (BOW ) approach. Despite its ease of use, BOW representation cannot handle word synonymy and polysemy problems and does not consider semantic relatedness between words. In this paper, we overcome the shortages of the BOW approach by embedding a known WordNet-based semantic re...
متن کامل